Introduction of http://archive.ics.uci.edu/ml/machine-learning-databases/00502/ dataset.

Transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2010.

Changing format for a more lightweight solution.

#data <- readxl::read_xlsx("../online_retail_II.xlsx")

#saveRDS(data, file = "online_retail_II.rds")

Reading data.

colnames(data)
## [1] "Invoice"     "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "Price"       "Customer ID" "Country"

InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.

StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invoice date and time. Numeric. The day and time when a transaction was generated.

UnitPrice: Unit price. Numeric. Product price per unit in sterling (£).

CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal. The name of the country where a customer resides.

# some basic cleaning and factoring these columns for easier visualization of ratios
data <- data %>% clean_names()
data$description <- as.factor(data$description)
data$description <- fct_explicit_na(data$description)
data$country <- as.factor(data$country)
data$customer_id <- fct_explicit_na(as.factor(data$customer_id))
data$stock_code <- fct_explicit_na(as.factor(data$stock_code))


summary(data)
##    invoice            stock_code    
##  Length:525461      85123A :  3516  
##  Class :character   22423  :  2221  
##  Mode  :character   85099B :  2057  
##                     21212  :  1933  
##                     21232  :  1843  
##                     20725  :  1620  
##                     (Other):512271  
##                              description        quantity       
##  WHITE HANGING HEART T-LIGHT HOLDER:  3549   Min.   :-9600.00  
##  (Missing)                         :  2928   1st Qu.:    1.00  
##  REGENCY CAKESTAND 3 TIER          :  2212   Median :    3.00  
##  STRAWBERRY CERAMIC TRINKET BOX    :  1843   Mean   :   10.34  
##  PACK OF 72 RETRO SPOT CAKE CASES  :  1466   3rd Qu.:   10.00  
##  ASSORTED COLOUR BIRD ORNAMENT     :  1457   Max.   :19152.00  
##  (Other)                           :512006                     
##   invoice_date                     price              customer_id    
##  Min.   :2009-12-01 07:45:00   Min.   :-53594.36   (Missing):107927  
##  1st Qu.:2010-03-21 12:20:00   1st Qu.:     1.25   14911    :  5710  
##  Median :2010-07-06 09:51:00   Median :     2.10   17841    :  5114  
##  Mean   :2010-06-28 11:37:36   Mean   :     4.69   14606    :  3927  
##  3rd Qu.:2010-10-15 12:45:00   3rd Qu.:     4.21   14156    :  2710  
##  Max.   :2010-12-09 20:01:00   Max.   : 25111.09   12748    :  2665  
##                                                    (Other)  :397408  
##            country      
##  United Kingdom:485852  
##  EIRE          :  9670  
##  Germany       :  8129  
##  France        :  5772  
##  Netherlands   :  2769  
##  Spain         :  1278  
##  (Other)       : 11991
tail(data)
## # A tibble: 6 x 8
##   invoice stock_code description quantity invoice_date        price customer_id
##   <chr>   <fct>      <fct>          <dbl> <dttm>              <dbl> <fct>      
## 1 538171  20971      PINK BLUE <85>        2 2010-12-09 20:01:00  1.25 17530      
## 2 538171  22271      FELTCRAFT <85>        2 2010-12-09 20:01:00  2.95 17530      
## 3 538171  22750      FELTCRAFT <85>        1 2010-12-09 20:01:00  3.75 17530      
## 4 538171  22751      FELTCRAFT <85>        1 2010-12-09 20:01:00  3.75 17530      
## 5 538171  20970      PINK FLORA<85>        2 2010-12-09 20:01:00  3.75 17530      
## 6 538171  21931      JUMBO STOR<85>        2 2010-12-09 20:01:00  1.95 17530      
## # <85> with 1 more variable: country <fct>
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    525461 obs. of  8 variables:
##  $ invoice     : chr  "489434" "489434" "489434" "489434" ...
##  $ stock_code  : Factor w/ 4631 levels "10002","10002R",..: 3892 3052 3054 1394 709 1417 1278 972 1690 1689 ...
##  $ description : Factor w/ 4644 levels "*Boombox Ipod Classic",..: 31 2943 4431 3259 4107 2963 3618 1492 829 1230 ...
##  $ quantity    : num  12 12 12 48 24 24 24 10 12 12 ...
##  $ invoice_date: POSIXct, format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
##  $ price       : num  6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
##  $ customer_id : Factor w/ 4384 levels "12346","12347",..: 508 508 508 508 508 508 508 508 508 508 ...
##  $ country     : Factor w/ 40 levels "Australia","Austria",..: 37 37 37 37 37 37 37 37 37 37 ...
#data explorer report
#DataExplorer::create_report(data)
#here we notice that quantity can be negative. this indicates it should be a good idea to create a total price value that multiplies qty * price
data %>% filter(startsWith(invoice,"C")) %>% filter(stock_code == "22087")
## # A tibble: 3 x 8
##   invoice stock_code description quantity invoice_date        price customer_id
##   <chr>   <fct>      <fct>          <dbl> <dttm>              <dbl> <fct>      
## 1 C489449 22087      PAPER BUNT<85>      -12 2009-12-01 10:33:00  2.95 16321      
## 2 C533899 22087      PAPER BUNT<85>       -6 2010-11-19 11:50:00  2.95 13993      
## 3 C536164 22087      PAPER BUNT<85>      -65 2010-11-30 12:09:00  2.9  15369      
## # <85> with 1 more variable: country <fct>
data <- data %>% mutate(total_price = quantity * price)

Comparing daily turnover of online retailer platform in 2009 and 2010 December.

What we can see here is that: - in December 2009, there were sales up until the 23rd of December - in December 2010 sales were made only until the 9th of December. It suggests there might be missing data from the last days of 2010 - We can also deduce that this online retail stops working from Christmas until January.

ts_plot(decembercomparison2009, Ytitle = "Total Sales", line.mode = "lines+markers")
ts_plot(decembercomparison2010, Ytitle = "Total Sales", line.mode = "lines+markers")

Visualizing distribution of daily purchases

Here we see the amount of different items per invoice per day for December 2009.

fntltp <- JS("function(){
  return this.point.x + ' ' +  this.series.yAxis.categories[this.point.y] + ':<br>' +
  Highcharts.numberFormat(this.point.value, 2);
}")


data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price))%>% ungroup() %>% group_by(day) %>% count(invoice) %>% arrange(day, -n) %>% ungroup() %>% hchart("heatmap", hcaes(x = day, y = invoice, value = n)) %>% 
  hc_yAxis(reversed = TRUE, offset = -20, tickLength = 0,
           gridLineWidth = 0, minorGridLineWidth = 0,
           labels = list(style = list(fontSize = "8px"))) %>%  
  hc_tooltip(formatter = fntltp) %>% 
  hc_size(height = 1000)
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

Here we see the amount of different items per invoice per day for December 2010.

data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price))%>% ungroup() %>% group_by(day) %>% count(invoice) %>% arrange(day, -n) %>% ungroup() %>% hchart("heatmap", hcaes(x = day, y = invoice, value = n)) %>% 
  hc_yAxis(reversed = TRUE, offset = -20, tickLength = 0,
           gridLineWidth = 0, minorGridLineWidth = 0,
           labels = list(style = list(fontSize = "8px"))) %>%  
  hc_tooltip(formatter = fntltp) %>% 
  hc_size(height = 1000)

Top 20 favorite profit maker friends for both December’s (2009 and 2010).

data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009, 2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(customer_id, as.factor(day), year) %>% summarise(sales_per_invoice = sum(total_price)) %>% ungroup() %>% arrange(-sales_per_invoice) %>% head(20)
## # A tibble: 20 x 4
##    customer_id `as.factor(day)`  year sales_per_invoice
##    <fct>       <fct>            <dbl>             <dbl>
##  1 18102       7                 2010            25920.
##  2 (Missing)   6                 2010            23395.
##  3 (Missing)   3                 2010            23022 
##  4 (Missing)   14                2009            21913.
##  5 18102       3                 2009            18475 
##  6 (Missing)   22                2009            17307.
##  7 (Missing)   16                2009            15618.
##  8 (Missing)   9                 2010            15354.
##  9 13694       14                2009            14190.
## 10 (Missing)   1                 2010            12584.
## 11 (Missing)   7                 2009            11914.
## 12 18102       11                2009            11016 
## 13 (Missing)   18                2009            10895.
## 14 (Missing)   1                 2009            10465.
## 15 (Missing)   2                 2009            10185.
## 16 (Missing)   9                 2009             9502.
## 17 15061       2                 2010             9407.
## 18 14646       18                2009             8370.
## 19 14156       21                2009             7113.
## 20 (Missing)   4                 2009             6660.

Top 10 selling items in the two years’ Decembers.

data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009, 2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price)) %>% ungroup() %>% count(description) %>% arrange(-n) %>% head(10) %>%
  hchart(type = "bar", hcaes(x = description, y = n))

Top 20 selling items in the whole dataset.

head(summary(data$description), 20)
## WHITE HANGING HEART T-LIGHT HOLDER                          (Missing) 
##                               3549                               2928 
##           REGENCY CAKESTAND 3 TIER     STRAWBERRY CERAMIC TRINKET BOX 
##                               2212                               1843 
##   PACK OF 72 RETRO SPOT CAKE CASES      ASSORTED COLOUR BIRD ORNAMENT 
##                               1466                               1457 
##        60 TEATIME FAIRY CAKE CASES           HOME BUILDING BLOCK WORD 
##                               1400                               1386 
##            JUMBO BAG RED RETROSPOT               LUNCH BAG RED SPOTTY 
##                               1310                               1274 
##       REX CASH+CARRY JUMBO SHOPPER             JUMBO STORAGE BAG SUKI 
##                               1232                               1220 
## PACK OF 60 PINK PAISLEY CAKE CASES         WOODEN FRAME ANTIQUE WHITE 
##                               1196                               1190 
##            LUNCH BAG  BLACK SKULL.       BAKING SET 9 PIECE RETROSPOT 
##                               1179                               1176 
##             LUNCH BAG SUKI  DESIGN              HEART OF WICKER LARGE 
##                               1157                               1151 
##           LOVE BUILDING BLOCK WORD   RED HANGING HEART T-LIGHT HOLDER 
##                               1142                               1129

Customer concentration

Most similar clients:

#' Compute Gower distance in december 2010 data because it's time consuming
concentrationdf <- data %>% group_by(customer_id, country) %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2010))) %>% select(-invoice_date, -price, -quantity) %>% summarise(sum_total_price = sum(total_price))
#concentrationdf$invoice <- as.factor(concentrationdf$invoice) 
gower_dist <- daisy(concentrationdf, metric = "gower")

gower_mat <- as.matrix(gower_dist)
#' Print most similar clients
concentrationdf[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 x 3
## # Groups:   customer_id [2]
##   customer_id country        sum_total_price
##   <fct>       <fct>                    <dbl>
## 1 13402       United Kingdom            308.
## 2 13369       United Kingdom            308.

Most dissimilar clients:

#' Print most dissimilar clients
concentrationdf[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 x 3
## # Groups:   customer_id [2]
##   customer_id country        sum_total_price
##   <fct>       <fct>                    <dbl>
## 1 (Missing)   United Kingdom         72316. 
## 2 12721       France                   -27.2
sil_width <- c(NA)
for(i in 2:15){
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:15, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width") +
lines(1:15, sil_width)

## integer(0)

Here we can see the results of the clusterisation:

k <- 10

pam_fit <- pam(gower_dist, diss = TRUE, k)


tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering))
ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Used Gower distance to find similar customers and found a pattern in one group. The rest of the 9. It needs more tuning since variables were left behind due to hard drive size limitations.

Amount of different item orders by country

We have entries from 40 different countries.

countryfreq <- summary(data$country)
countries <- stack(countryfreq)

countries %>% rename(country = ind, n = values) %>% arrange(-n)
##         n              country
## 1  485852       United Kingdom
## 2    9670                 EIRE
## 3    8129              Germany
## 4    5772               France
## 5    2769          Netherlands
## 6    1278                Spain
## 7    1187          Switzerland
## 8    1101             Portugal
## 9    1054              Belgium
## 10    906      Channel Islands
## 11    902               Sweden
## 12    731                Italy
## 13    654            Australia
## 14    554               Cyprus
## 15    537              Austria
## 16    517               Greece
## 17    432 United Arab Emirates
## 18    428              Denmark
## 19    369               Norway
## 20    354              Finland
## 21    310          Unspecified
## 22    244                  USA
## 23    224                Japan
## 24    194               Poland
## 25    172                Malta
## 26    154            Lithuania
## 27    117            Singapore
## 28    111                  RSA
## 29    107              Bahrain
## 30     77               Canada
## 31     76            Hong Kong
## 32     76             Thailand
## 33     74               Israel
## 34     71              Iceland
## 35     63                Korea
## 36     62               Brazil
## 37     54          West Indies
## 38     34              Bermuda
## 39     32              Nigeria
## 40     13              Lebanon

Shopping dynamics of a particular day: 14th of November 2010

dynamics <- data %>% filter(day(invoice_date) == 14 & month(invoice_date) == 11 & (year(invoice_date) %in% c(2010))) %>% select(-quantity)

#lags
dyna_tl <- mutate( dynamics, total_price = lag( total_price ) )
dyna_all <- bind_rows( old = dynamics, new = dyna_tl, .id="source" ) %>%
    arrange( invoice_date, source) 

ggplot(dynamics, aes(invoice_date, total_price)) +
  geom_step() +
  geom_ribbon( data = dyna_all, aes( ymin = 0, ymax = total_price ),
             fill="tomato", alpha=0.5 )

Customer loyalty in August 2010

This can be reproduced in a bigger scale by increasing the amount of minimum interactions of a customer with the store.

loyaltysubset <- data %>% filter(month(invoice_date) == 8 & (year(invoice_date) %in% c(2010))) %>% select(-stock_code, -description, -invoice, -quantity, -price)

#remove unimportant clients
tt <- table(loyaltysubset$customer_id)
rare_levels <- names(tt)[tt<200]
loyaltysubset <- subset(loyaltysubset,!customer_id %in% rare_levels)


summary(loyaltysubset)
##   invoice_date                    customer_id             country    
##  Min.   :2010-08-01 14:23:00   (Missing):6364   United Kingdom:7720  
##  1st Qu.:2010-08-10 12:08:00   16549    : 507   EIRE          : 462  
##  Median :2010-08-13 18:07:00   17841    : 353   Hong Kong     :   2  
##  Mean   :2010-08-16 04:23:50   14911    : 284   Australia     :   0  
##  3rd Qu.:2010-08-24 14:29:00   14606    : 236   Austria       :   0  
##  Max.   :2010-08-31 17:31:00   13089    : 225   Bahrain       :   0  
##                                (Other)  : 215   (Other)       :   0  
##   total_price       
##  Min.   :-18910.69  
##  1st Qu.:     2.51  
##  Median :     5.06  
##  Mean   :    10.58  
##  3rd Qu.:    11.82  
##  Max.   :  2367.28  
## 
plot_timeline <- plot_timeline(loyaltysubset, save_path = "loyalty.png")
## INFO [2020-03-19 23:58:38] invoice_date has been selected as the timestamp column
## INFO [2020-03-19 23:58:38] total_price has been selected as the numeric column(s)
## INFO [2020-03-19 23:58:38] customer_id, country has been selected as the state column(s)
## INFO [2020-03-19 23:58:38] creating state plot layers
## INFO [2020-03-19 23:58:39] creating sample plot layers
## Aligning plots
## INFO [2020-03-19 23:58:39] Writing image to file as PNG in:
## Plotting
grid.draw(plot_timeline, recording=TRUE)

Timeline of loyalty subset

Timeline of loyalty subset

Predicting daily sales in January 2010 x day ahead forecasts.

jan2010 <- data %>% filter(month(invoice_date) == 1 & year(invoice_date) == 2010) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(day, year) %>% summarise(daily_sales=sum(total_price)) %>% ungroup()

jan2010 <- jan2010 %>% mutate( date = as.Date(ymd(paste0(year, "-01-", day)))) %>% select(daily_sales, date)

Plot of SMA

smooth::sma(jan2010$daily_sales, h = 5, interval =TRUE, silent= FALSE)

## Time elapsed: 0.02 seconds
## Model estimated: SMA(24)
## Initial values were produced using backcasting.
## 
## Loss function type: MSE; Loss function value: 183000530.6431
## Error standard deviation: 14129.2938
## Sample size: 24
## Number of estimated parameters: 2
## Number of degrees of freedom: 22
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 528.7090 529.2805 531.0651 531.9732 
## 
## 95% parametric prediction interval was constructed